/*
21 December 2011

Demonstration of Heckman-type selection model to correct HIV prevalence estimates for selection bias due to non-participation in HIV tesing
in the Zambia 2007 Demographic and Health Survey

Please read accompanying document "Heckman selection model to correct HIV prevalence estimates for survey nonparticipation: Stata dofile - example DHS Zambia 2007.docx"

A detailed description of this approach is presented in the following article and online appendices: 

  B�rnighausen T, Bor J, Wandira-Kazibwe S, Canning D (2011). Correcting HIV prevalence estimates for survey nonparticipation using Heckman-type selection models. 
  Epidemiology. 2011;22(1):27-35.
  
  The results this dofile will produce are essentially the same as those shown in B�rnighausen et al. (2011). Minimal differences are due to the following 
  two changes: First, the DHS datasets have been updated since the work on the 2011 paper; second, we have used slightly different functional forms for a 
  few of the control variables in this dofile, to make them consistent with the functional forms we used in a recent replication of this approach for all 
  available DHS in sub-Saharan Africa:
  
  Hogan D, Salomon JA, Canning D, Hammitt JK, Zaslavksy A, B�rnighausen T. National HIV prevalence estimates for sub-Saharan Africa: revisiting the impact 
  of survey non-participation. (submitted) 2011.   

Please refer to DHS codebooks for variable descriptions after obtaining data (visit http://www.measuredhs.com/data/dataset/Zambia_Standard-DHS_2007.cfm)

Use of this computer program, or adaptations based on it, is free to all provided that any resulting publications cite the authors of:

 B�rnighausen T, Bor J, Wandira-Kazibwe S, Canning D (2011). Correcting HIV prevalence estimates for survey nonparticipation using Heckman-type selection models. 
  Epidemiology. 2011;22(1):27-35.
  and
   Hogan D, Salomon JA, Canning D, Hammitt JK, Zaslavksy A, B�rnighausen T. National HIV prevalence estimates for sub-Saharan Africa: revisiting the impact 
  of survey non-participation. (submitted) 2011.   

 as the source of the computer code.

*/

cd "/Users/chalta_hai/Documents/Harvard/Research/HIV Selection/Demo"

clear
cap clear matrix
cap log close
log using Zambia2007demo, replace
set mem 1000m
set matsize 11000
set maxvar 20000set more off
set seed 7

*************************************************************************************
*************************************************************************************
****** MERGING DATASETS
*************************************************************************************
*************************************************************************************

*Women individual questionnaire
use "ZMIR51FL.DTA", clear

duplicates report v001 v002 v003

gen long IID = (v001*100000)+ (v002*100) + v003

duplicates report IID

keep IID v001 v002 v003 v005 v012 v013 v015 v028 v190 v133 v026 v501 v502 v511 v525 v531 v761 ///
     v766a v766b v101 v130 v131 v751 v775 v778 v781 v763a v763b v763c slangi s1012a v463a v463b v463c  v844 v845 v846 v840 v843 v785

sort IID
save "zmir51fl_m.dta", replace 

*Men individual questionnaire
use "ZMMR51FL.DTA", clear 

duplicates report mv001 mv002 mv003

gen long IID = mv001*100000 + mv002*100 + mv003

duplicates report IID

keep IID mv001 mv002 mv003 mv005 mv012 mv013 mv015 mv028 mv190 mv133 mv026 mv501 mv502 mv511 mv525 ///
	 mv531 mv761 mv766a mv766b mv101 mv130 mv131 mv483 mv751 mv775 mv778 mv781 ///
	 mv763a mv763b mv763c mv767a mv767b mv767c mv793 smlangi sm812 mv463a mv463b mv463c mv785 mv844 mv845 mv846

sort IID
save "zmmr51fl_m.dta", replace 

*Household questionnaire
use "ZMPR51FL.DTA", clear

duplicates report hv001 hv002 hvidx

gen long IID = hv001*100000 + hv002*100 + hvidx

duplicates report hv001 hv002 hvidx

keep IID hv001 hv002 hvidx hv003 hv005 hv006 hv016 hv017 hv018 hv021 hv022 hv024 hv025 hv026 hv030 ///
     hv103 hv104 hv105 hv106 hv107 hv108 hv117 hv118 hv270 hv271 ///
     hml16 ha61 ha63 ha64 hb61 hb63 hb64

sort IID
save "zmpr51fl_m.dta", replace

*HIV test results
use "ZMAR51FL.DTA", clear

duplicates report hivclust hivnumb hivline

gen long IID = hivclust*100000 + hivnumb*100 + hivline

duplicates report IID

sort IID
save "zmar51fl_m.dta", replace


***** Now merge the data sets
*Individual women
use "zmir51fl_m.dta", clear

*Individual men
merge 1:1 IID using "zmmr51fl_m.dta"
tab _merge
rename _merge _merge1
sort IID

*Household
merge 1:1 IID using "zmpr51fl_m.dta"
tab _merge
rename _merge _merge2
sort IID

*HIV
merge 1:1 IID using "zmar51fl_m.dta"
tab _merge
rename _merge _merge3
save "Zambia_master.dta", replace

duplicates report IID


*************************************************************************************
*************************************************************************************
****** GENERATING NEW VARIABLES
*************************************************************************************
*************************************************************************************

use "Zambia_master.dta", clear

*Sex of respondent
rename hv104 sex
label values sex lblsex
label define lblsex 1 "male" 2 "female"

*** Survey sampling weights
gen hhweight = hv005/1000000
gen hivweight = hiv05/1000000
gen wiweight = v005/1000000
gen miweight = mv005/1000000


*** Samples

*Eligible individuals
gen universe = 1 if hv117 == 1 | hv118 ==1 
replace universe = 0 if hv117 ==0 & hv118 ==0
tab universe sex, m

*Consent to HIV test
gen hivconsent = 1 if (hiv03==1 | hiv03==0) & universe ==1 
replace hivconsent = 0 if (hiv03!=1 & hiv03!=0) & universe ==1

*Individual interview respondents
gen interview= 1 if (v015==1 | mv015 ==1) & universe ==1
replace interview= 0 if (v015!=1 & mv015!=1) & universe ==1

* Outcome (HIV infection status)
gen hiv = 1 if hiv03 ==1 & universe ==1
replace hiv= 0 if hiv03 ==0 & universe ==1


*************************************************************************************
****** INDEPENDENT VARIABLES
*************************************************************************************

*Urban/rural
gen urban = 1 if hv025 ==1
replace urban =0 if hv025 ==2

*Education
gen education = hv108
replace education = . if education == 98 | education == 99
tab education hv106 if universe==1, m

*Age
gen age = hv105 if hv105!=99 & hv105!=98
egen agecat5=cut(age), at(15 20 25 30 35 40 45 50 55 60) label

*Wealth
gen wealthcat = hv270

*Location
gen location = hv026

*Region
rename hv024 region

*Marriage (never/currently/former)
gen marital = v502 if sex == 2
replace marital = mv502 if sex == 1
tab marital sex if interview==1, m

*STDs
* women
gen wstd=1 if v763a==1 | v763b==1 | v763c==1
replace wstd=0 if v763a==0 & v763b==0 & v763c==0
replace wstd=0 if v763a==0 & (v763b!=1 & v763c!=1)                     /*Assume response to question "a" informs questions "b" and "c" when those responses are missing*/

* men
gen mstd=1 if mv763a==1 | mv763b==1 | mv763c==1
replace mstd=0 if mv763a==0 & mv763b==0 & mv763c==0
replace mstd=0 if mv763a==0 & (mv763b!=1 & mv763c!=1) 

* combine
gen std = wstd if sex==2
replace std = mstd if sex ==1
tab std sex if interview==1, m


**** Sexual behavior variables
*Ever had sex
gen weversex = 1 if v525>0 & v525!=. & v525!=98 & v525!=99
replace weversex = 0 if v525==0

gen meversex = 1 if mv525>0 & mv525!=. & v525!=98 & mv525!=99
replace meversex = 0 if mv525==0

gen eversex = weversex if sex==2
replace eversex = meversex if sex==1
replace eversex = 1 if (mv766b>0 & mv766b<98) | (v766b>0 & v766b<98)    /*Observation missing v525 but report having had sex in past year*/

*Age first sex
gen wage1sex = v525
replace wage1sex= v511 if v525==96
replace wage1sex= . if wage1sex==99 | wage1sex==98

gen wage1sex_cat= 1 if wage1sex==0
replace wage1sex_cat=2 if wage1sex<=15 & wage1sex>0
replace wage1sex_cat=3 if wage1sex>15 & wage1sex != .

gen mage1sex = mv525
replace mage1sex = mv511 if mv525==96
replace mage1sex= . if mage1sex==99 | mage1sex==98

gen mage1sex_cat = 1 if mage1sex==0
replace mage1sex_cat = 2 if mage1sex<=15 & mage1sex>0
replace mage1sex_cat = 3 if mage1sex>15 & mage1sex != .
replace mage1sex=. if mage1sex==99

gen age1sex_cat = mage1sex_cat if sex == 1
replace age1sex_cat = wage1sex_cat if sex == 2
tab age1sex_cat sex if interview==1, m

*Sex in last 12 months
gen wsex12month=1 if v766b>0 & v766b!=. & v766b!=98 & v766b!=99
replace wsex12month=0 if v766b==0

gen msex12month=1 if mv766b>0 & mv766b!=. & mv766b!=98 & mv766b!=99
replace msex12month=0 if mv766b==0

gen sex12month = wsex12month if sex==2
replace sex12month = msex12month if sex ==1
tab sex12month sex if interview==1, m

*Condom
gen wcondom = 0 if wsex12month==0
replace wcondom = 1 if v761==1 
replace wcondom = 0 if v761==0 

gen mcondom = 0 if msex12month==0
replace mcondom = 1 if mv761==1
replace mcondom = 0 if mv761==0 

gen condom = wcondom if sex ==2
replace condom = mcondom if sex==1
tab condom sex if interview==1, m

*High risk sex
gen whighhiv=1 if v766a>0 & v766a<98
replace whighhiv=0 if v766a==0

gen mhighhiv=1 if mv766a>0 & mv766a<98
replace mhighhiv=0 if mv766a==0

gen highhiv = whighhiv if sex==2
replace highhiv = mhighhiv if sex==1
tab highhiv sex if interview==1, m

*Number of partners
gen mpartner=mv766b if mv766b!=98
replace mpartner=2 if mv766b>=2 & mv766b<98

gen wpartner=v766b if v766b!=98
replace wpartner=2 if v766b>=2 & v766b<98

gen partner = wpartner if sex ==2
replace partner = mpartner if sex ==1
replace partner = . if partner==99
tab partner sex if interview==1, m

*****HIV knowledge and attitudes 

*Ever heard of HIV
gen knowhiv = 1 if v751==1 & sex == 2
replace knowhiv = 0 if v751==0 & sex == 2
replace knowhiv = 1 if mv751==1 & sex == 1
replace knowhiv = 0 if mv751==0 & sex == 1
tab knowhiv sex if interview==1, m

*Knows someone who has or died of AIDS
gen wknowsdiedofaids=v775
tab wknowsdiedofaids if sex==2 & interview==1, m
replace wknowsdiedofaids=. if wknowsdiedofaids==9
replace wknowsdiedofaids=0 if knowhiv==0 & wknowsdiedofaids==.   /*Never heard of HIV*/
replace wknowsdiedofaids=0 if v844==2 & wknowsdiedofaids==.      /*Don't know anyone with aids*/
replace wknowsdiedofaids=1 if (v844==1 | v845==1 | v846==1) & wknowsdiedofaids==.   /*Answer 1 to at least 1 question*/
tab wknowsdiedofaids if sex==2 & interview==1, m

gen mknowsdiedofaids=mv775
tab mknowsdiedofaids if sex==1 & interview==1, m
replace mknowsdiedofaids=. if mknowsdiedofaids==9
replace mknowsdiedofaids=0 if knowhiv==0 & mknowsdiedofaids==.   /*Never heard of HIV*/
replace mknowsdiedofaids=0 if mv844==2 & mknowsdiedofaids==.     /*Don't know anyone with aids*/
replace mknowsdiedofaids=1 if (mv844==1 | mv845==1 | mv846==1) & mknowsdiedofaids==.   /*Answer 1 to at least 1 question*/
tab mknowsdiedofaids if sex==1 & interview==1, m

gen knowsdiedofaids = wknowsdiedofaids if sex==2
replace knowsdiedofaids = mknowsdiedofaids if sex==1
tab knowsdiedofaids sex if interview==1, m

*Would care for relative with AIDS
*Note: People who respond "don't know" (v778==8) are assigned 0
gen waidscare = v778 
replace waidscare=. if waidscare==9 
replace waidscare=0 if waidscare==8

gen maidscare = mv778
replace maidscare=. if maidscare==9
replace maidscare=0 if maidscare==8

gen aidscare = waidscare if sex==2
replace aidscare = maidscare if sex==1
tab aidscare sex if interview==1, m
replace aidscare = 0 if knowhiv == 0 & aidscare==.
tab aidscare sex if interview==1, m

*Ever tested HIV
gen wevertestedHIV = 1 if v781 == 1 
replace wevertestedHIV =0 if v781 ==0
replace wevertestedHIV=1 if v840==1 & wevertestedHIV==.    /*ANC testing experience*/

gen mevertestedHIV=1 if mv781 ==1
replace mevertestedHIV=0 if mv781==0

gen evertestedHIV = wevertestedHIV if sex==2
replace evertestedHIV = mevertestedHIV if sex==1
bysort sex: tab knowhiv evertestedHIV if interview==1, m
replace evertestedHIV = 0 if evertestedHIV == . & knowhiv==0
tab evertestedHIV sex if interview==1, m

**Other variables

*Alcohol
gen walcohol= 1 if  s1012a==1 & sex==2 & universe==1
replace walcohol = 0 if s1012a==0 & sex==2 & universe==1

gen malcohol= 1 if sm812==1 & sex==1 & universe==1
replace malcohol = 0 if sm812==0 & sex==1 & universe==1 

gen alcohol = walcohol if sex ==2
replace alcohol = malcohol if sex ==1
tab alcohol sex if interview==1, m

*Smoking
gen wsmoke = 1 if  v463a ==1 | v463b ==1 | v463c ==1
replace wsmoke = 0 if v463a ==0 & v463b ==0 & v463c ==0

gen msmoke = 1 if  mv463a ==1 | mv463b ==1 | mv463c ==1
replace msmoke = 0 if mv463a ==0 & mv463b ==0 & mv463c ==0

gen smoke = wsmoke if sex ==2
replace smoke = msmoke if sex ==1
tab smoke sex if interview==1, m

*Ethnicity
gen ethnicity = v131
qui replace ethnicity=mv131 if ethnicity==.
tab ethnicity
forvalues i=1(1)67 {
count if ethnicity==`i' & universe==1
qui replace ethnicity=99 if ethnicity==`i' & r(N)<200    /*Group ethnicities represented by fewer than 200 individuals*/
}
tab ethnicity sex if interview==1, m

*Religion
gen religion = v130 if sex ==2
replace religion = mv130 if sex==1
replace religion = 3 if religion ==96
replace religion = . if religion == 99
tab religion sex if interview==1, m

*Language
gen mlanguage = smlangi 
replace mlanguage = 9 if mlanguage==6 | mlanguage==3 | mlanguage==5

gen wlanguage = slangi
replace wlanguage = 9 if wlanguage==6 | wlanguage==3 | wlanguage==5

gen language = mlanguage if sex == 1
replace language = wlanguage if sex == 2
tab language sex if interview==1, m


** Survey strata
egen stratum = group(region urban)

*************************************************************************************
****** SELECTION VARIABLES
*************************************************************************************

** Interviewer identity
**  - Assign dummy variables to each interviewer who conducted at least 50 interviews of a given type
**  - The four interview types are distinguished by sex of the respondent (men vs. women) and by data source (individual questionnaire vs. household questionnaire): 
**    men individual questionnaire, men household questionnaire, women individual questionnaire, women household questionnaire
	
*Individual interviewer
tab mv028 hivconsent
tab v028 hivconsent

gen interviewerID = mv028 if sex==1 & universe==1
replace interviewerID = v028 if sex==2 & universe==1
tab interviewerID sex
gen interviewerID_50w = interviewerID if sex==2
gen interviewerID_50m = interviewerID if sex==1

*Household interview
gen hhinterviewerID = hv018 if universe==1
gen hhinterviewerID_50w = hhinterviewerID if agecat5!=. & education!=. & wealthcat!=. & region!=. & location!=. & universe==1 & sex==2 & hivconsent!=.
gen hhinterviewerID_50m = hhinterviewerID if agecat5!=. & education!=. & wealthcat!=. & region!=. & location!=.  & universe==1 & sex==1 & hivconsent!=.

tab interviewerID
tab hhinterviewerID

**** Fieldworkers
*Men indivdiual questionnaire
forvalues i=1(1)999 {
 qui count if interviewerID_50m==`i'
 qui replace interviewerID_50m = 1000 if interviewerID_50m==`i' & r(N)< 50    /*Interviewers with less than 50 interviews are combined*/
}
*Women indivdiual questionnaire
forvalues i=1(1)999 {
 qui count if interviewerID_50w==`i'
 qui replace interviewerID_50w = 1000 if interviewerID_50w==`i' & r(N)< 50 
}
*Men household questionnaire
forvalues i=1(1)999 {
 qui count if hhinterviewerID_50m==`i'
 qui replace hhinterviewerID_50m = 1000 if hhinterviewerID_50m==`i' & r(N)< 50
}
*Women household questionnaire
forvalues i=1(1)999 {
 qui count if hhinterviewerID_50w==`i'
 qui replace hhinterviewerID_50w = 1000 if hhinterviewerID_50w==`i' & r(N)< 50
}

tab interviewerID_50w hivconsent
tab interviewerID_50m hivconsent
tab hhinterviewerID_50w hivconsent
tab hhinterviewerID_50m hivconsent

**Household (hh) visited on first day interview team visited cluster
* Generate earliest month in each cluster
sort hv001 hv006
bysort hv001: gen firstmonth = hv006[1]

* Generate day of successful hh visit 
gen day = hv006*30 + hv016 if firstmonth==4          /*firstmonth==4 is April*/
replace day = hv006*31 + hv016 if firstmonth==5
replace day = hv006*30 + hv016 if firstmonth==6
replace day = hv006*31 + hv016 if firstmonth==7
replace day = hv006*31 + hv016 if firstmonth==8
replace day = hv006*30 + hv016 if firstmonth==9
replace day = hv006*31 + hv016 if firstmonth==10

* Generate day rank position
su hv001
sort hv001 day
forvalues i=1(1)320  {
egen dayrank_`i' = group(hv001 day) if hv001==`i' 
}

gen dayrankoverall = dayrank_1
forvalues i =2(1)320 {
qui replace dayrankoverall = dayrank_`i' if dayrankoverall ==.
}
drop dayrank_*
bysort hv001: gen dayrankmax = dayrankoverall[_N]
gen dayrankrelative = dayrankmax - dayrankoverall + hv017

gen firstday = 1 if dayrankmax <= dayrankrelative 
replace firstday = 0 if dayrankmax>dayrankrelative


save "clean", replace


*************************************************************************************
*************************************************************************************
****** ANALYSIS
*************************************************************************************
*************************************************************************************
/*
The analytic stragey is to build up information on HIV status for every eligible indivdiual in the data set. The majority
of individals have observed HIV status and we can use those data directly. Those individuals with missing HIV status
data are missing those data for two main reasons: (1) they completed the individual questionnaire but refused to consent to an HIV test
or (2) they could not be contacted by the survey team or were contacted by the survey team but did refused to answer the in the 
indivudal questionnaire. For both groups, we estimate Heckman-type selection models and then predict HIV status based on the models.  
For individuals in group (1), the Heckman-typse selection model includes a larger set of covariates ("consent regressions") than for 
individuals in group (2) ("contact regressions"),  because for group (1) the data from both the individual and the household questionnaire 
are available, while for group (2) only the data from the household questionnaire is available. We combine the HIV status (whether observed 
or predicted) for these three groups and calculate a sampling weighted average to obtain national estimates of HIV prevalence.
*/


use "clean.dta", clear


*** Macros for observed covariates

*Men individual questionnaire
global ind_vars_m agecat5 education wealthcat location region marital  ///
	  std age1sex_cat highhiv partner condom aidscare knowsdiedofaids evertestedHIV ///
	  smoke alcohol religion ethnicity language 

global reg_ind_vars_m i.agecat5 education i.wealthcat i.location i.region i.marital  ///
	  i.std i.age1sex_cat i.highhiv i.partner i.condom i.aidscare i.knowsdiedofaids i.evertestedHIV ///
	  i.smoke i.alcohol i.religion i.ethnicity i.language 
*Note: one list of variable names; another list of the same variables using dummy variables where appropriate in the regression models

*Women individual questionnaire
global ind_vars_w agecat5 education wealthcat location region marital  ///
	  std age1sex_cat highhiv partner condom aidscare knowsdiedofaids evertestedHIV ///
	  smoke alcohol religion ethnicity language 

global reg_ind_vars_w i.agecat5 education i.wealthcat i.location i.region i.marital  ///
	  i.std i.age1sex_cat i.highhiv i.partner i.condom i.aidscare i.knowsdiedofaids i.evertestedHIV ///
	  i.smoke i.alcohol i.religion i.ethnicity i.language 
  
*Household questionnaire
global hh_vars agecat5 education wealthcat region location
global reg_hh_vars i.agecat5 education i.wealthcat i.region i.location



*** Samples

** Men, individual interview

*Respondents who are missing individual interview regression covariates
egen m_ind_nmiss = rmiss($ind_vars_m) if sex == 1

gen  tot_ind_m = 1 if universe==1 & interview==1 & sex==1 & hivconsent!=. ///
		& m_ind_nmiss==0 & interviewerID!=.
label var tot_ind_m "Individual interview sample: men, total"

gen tot_ind_m_hiv = 1 if tot_ind_m==1 & hiv!=.
label var tot_ind_m_hiv "Individual interview sample: men, HIV status available (consented)"

gen tot_ind_m_nonhiv = 1 if tot_ind_m==1 & hiv==.
label var tot_ind_m_nonhiv "Individual interview sample: men, HIV status NOT available (NO consent)"

tab tot_ind_m
tab tot_ind_m_hiv
tab tot_ind_m_nonhiv

** Women, individual interview

*Respondents who are missing individual interview regression covariates
egen w_ind_nmiss = rmiss($ind_vars_w) if sex == 2

gen tot_ind_w = 1 if universe==1 & interview==1 & sex==2 & hivconsent!=. ///
		& w_ind_nmiss==0 & interviewerID!=.
label var tot_ind_w "Individual interview sample: women, total"

gen tot_ind_w_hiv = 1 if tot_ind_w==1 & hiv!=.
label var tot_ind_w_hiv "Individual interview sample: women, HIV status available (consented)"

gen tot_ind_w_nonhiv = 1 if tot_ind_w==1 & hiv==.
label var tot_ind_w_nonhiv "Individual interview sample: women, HIV status NOT available (NO consent)"

tab tot_ind_w
tab tot_ind_w_hiv
tab tot_ind_w_nonhiv


** Men, household interview
egen hh_nmiss = rmiss($hh_vars)
gen tot_hh_m = 1 if universe==1 & sex==1 & hh_nmiss==0 & hivconsent!=. ///
		& hv103!=9 & firstday!=. & hhinterviewerID!=. 
label var tot_hh_m "Household interview sample: men, total"

gen tot_hh_m_hiv = 1 if tot_hh_m == 1 & hiv!=.
label var tot_hh_m_hiv "Household interview sample: men, HIV status available"

gen tot_hh_m_nonhiv_nonind = 1 if tot_hh_m == 1 & hiv==. & tot_ind_m != 1
label var tot_hh_m_nonhiv_nonind "Household interview sample: men, NO HIV status, NO interview"

tab tot_hh_m
tab tot_hh_m_hiv
tab tot_hh_m_nonhiv_nonind


** Women, household interview
gen tot_hh_w = 1 if universe==1 & sex==2 & hh_nmiss==0 & hivconsent!=. ///
		 & hv103!=9 & firstday!=. & hhinterviewerID!=.
label var tot_hh_w "Household interview sample: women, total"

gen tot_hh_w_hiv = 1 if tot_hh_w==1 & hiv!=.
label var tot_hh_w_hiv "Household interview sample: women, HIV status available"

gen tot_hh_w_nonhiv_nonind=1 if tot_hh_w==1 & hiv==. & tot_ind_w!=1 
label var tot_hh_w_nonhiv_nonind "Household interview sample: women, NO HIV status, NO interview"

tab tot_hh_w
tab tot_hh_w_hiv
tab tot_hh_w_nonhiv_nonind


*Prediction groups
gen predGroup = .
replace predGroup = 0 if (hiv==1 | hiv==0) & universe==1                             /*Individuals with observed HIV status*/
replace predGroup = 1 if tot_ind_m_nonhiv == 1 | tot_ind_w_nonhiv == 1               /*Individuals whose HIV status will be predicted with consent regression*/
replace predGroup = 2 if tot_hh_m_nonhiv_nonind == 1 | tot_hh_w_nonhiv_nonind == 1   /*Individuals whose HIV status will be predicted with contact regression*/

*Final regression samples tabbed against prediction groups (Valid, Consent, Contact)
tab predGroup sex if universe==1, m

*Consent regression men
tab predGroup tot_ind_m if universe==1 & sex==1, m
*Contact regression men
tab predGroup tot_hh_m if universe==1 & sex==1, m
*Consent regression women
tab predGroup tot_ind_w if universe==1 & sex==2, m
*Contact regression women
tab predGroup tot_hh_w if universe==1 & sex==2, m


*************************************************************************************
*************************************************************************************
****** ESTIMATION
*************************************************************************************
*************************************************************************************

*** Men

** Men consent

*Collapse collinear interviewers

replace interviewerID_50m = 1000 if interviewerID_50m ==703 | interviewerID_50m ==704 | ///			
  interviewerID_50m ==705 | interviewerID_50m ==706 | interviewerID_50m ==707 | interviewerID_50m ==708 
*Note: In the above step, we bin interviewers that were found to be highly collinear according to trace diagnostics.  
*The Heckman-type selection model regression results remain essentially unchanged if this step is eliminated.
*The same is true for the analogue steps for the other Heckman-type model regressions below (men contact, women constent, women contact). 

*Test selection variables (i.e., are the associated with consent?)
 probit hivconsent $reg_ind_vars_m i.interviewerID_50m, cluster(hv001)
 testparm i.interviewerID_50m

*Selection model
 heckprob hiv $reg_ind_vars_m, sel(hivconsent = $reg_ind_vars_m i.interviewerID_50m) cluster(hv001)
 
 *After running the model, make conditional prediction: prob(HIV+ | No_Participation) = prob(HIV+ and No_Participation) / prob(No_Participation)
 predict p10 if predGroup==1, p10
 predict p00 if predGroup==1, p00
 gen m_consent_p1_0 = p10/(p10+p00) if predGroup==1    /*The conditional prediction of HIV status among those without test*/
 su m_consent_p1_0 if predGroup==1   /*Predicted prevalence in those who refused consent*/
 drop p10 p00
 
** Men contact

*Collapse collinear interviewers
replace hhinterviewerID_50m = 1000 if hhinterviewerID_50m ==308 | hhinterviewerID_50m ==703 | hhinterviewerID_50m ==704 | ///
  hhinterviewerID_50m ==705 | hhinterviewerID_50m ==706 | hhinterviewerID_50m ==707 | hhinterviewerID_50m ==708

*Test selection variables
 probit hivconsent $reg_hh_vars firstday i.hhinterviewerID_50m, cluster(hv001)
 testparm i.hhinterviewerID_50m
 testparm firstday

*Selection model
 heckprob hiv $reg_hh_vars, sel(hivconsent = $reg_hh_vars firstday i.hhinterviewerID_50m) cluster(hv001)
 
 *Conditional prediction
 predict p10 if predGroup==2, p10
 predict p00 if predGroup==2, p00
 gen m_contact_p1_0 = p10/(p10+p00) if predGroup==2
 su m_contact_p1_0 if predGroup==2   /*Predicted prevalence in those who were not contacted*/
 drop p10 p00
 
 
*** Women

** Women consent

*Collapse collinear interviewers
 replace interviewerID_50w = 1000 if interviewerID_50w == 408 | interviewerID_50w == 906 | interviewerID_50w == 907 | interviewerID_50w == 908
 
*Test selection variables
 probit hivconsent $reg_ind_vars_w i.interviewerID_50w, cluster(hv001)
 testparm i.interviewerID_50w

*Selection model
 heckprob hiv $reg_ind_vars_w, sel(hivconsent = $reg_ind_vars_w i.interviewerID_50w) cluster(hv001)

 *Conditional prediction
 predict p10 if predGroup==1, p10
 predict p00 if predGroup==1, p00
 gen w_consent_p1_0 = p10/(p10+p00) if predGroup==1
 su w_consent_p1_0 if predGroup==1   /*Predicted prevalence in those who refused consent*/
 drop p10 p00


** Women contact 

*Collapse collinear interviewers
 replace hhinterviewerID_50w = 1000 if hhinterviewerID_50w == 308

*Test selection variables
 probit hivconsent $reg_hh_vars firstday i.hhinterviewerID_50w, cluster(hv001)
 testparm i.hhinterviewerID_50w
 testparm firstday

*Selection model
 heckprob hiv $reg_hh_vars, sel(hivconsent = $reg_hh_vars firstday i.hhinterviewerID_50w) cluster(hv001)

 *Conditional prediction
 predict p10 if predGroup==2, p10
 predict p00 if predGroup==2, p00
 gen w_contact_p1_0 = p10/(p10+p00) if predGroup==2
 su w_contact_p1_0 if predGroup==2   /*Predicted prevalence in those who were not contacted*/
 drop p10 p00
 
 
 
***National estimates of HIV prevalence

svyset hv001 [pweight=hhweight], strata(stratum)

**Men
replace hiv = m_consent_p1_0 if sex==1 & predGroup==1
replace hiv = m_contact_p1_0 if sex==1 & predGroup==2

svy: mean hiv, subpop(if sex==1)                  /*National estimate*/
svy: mean hiv, subpop(if sex==1 & predGroup==0)   /*Complete case*/
svy: mean hiv, subpop(if sex==1 & predGroup==1)   /*Those who refused consent*/
svy: mean hiv, subpop(if sex==1 & predGroup==2)   /*Those who were not contacted*/

**Women
replace hiv = w_consent_p1_0 if sex==2 & predGroup==1
replace hiv = w_contact_p1_0 if sex==2 & predGroup==2

svy: mean hiv, subpop(if sex==2)                  /*National estimate*/
svy: mean hiv, subpop(if sex==2 & predGroup==0)   /*Complete case*/
svy: mean hiv, subpop(if sex==2 & predGroup==1)   /*Those who refused consent*/
svy: mean hiv, subpop(if sex==2 & predGroup==2)   /*Those who were not contacted*/

log close
